Final Report

DS 4010 - Wildfire Dashboard
Authors
Affiliation

Dhruv Dole

Iowa State University

Nathan Rethwisch

Iowa State University

Colin Russell

Iowa State University

Thanh Mai

Iowa State University

Introduction/Project Description

Wildfires pose a significant challenge in the United States, with annual costs estimated between $394 billion and $893 billion, according to the Joint Economic Committee (U.S. Congress Joint Economic Committee 2023). Understanding the locations and causes of wildfires is essential for effective prevention, helping to safeguard lives, protect property, and reduce economic impact. This project focuses on estimating areas of elevated fire risk using historical weather patterns and wildfire data. The resulting dashboard provides access to information about past weather conditions and their connection with wildfire activity. Designed for a wide range of users, the dashboard serves aims to serve both high-level policymakers involved in wildfire prevention and local officials tasked with creating region-specific regulations.

Figure 1: Estimated Cost of Wildfires in 2022

Data

Sources & Acquisition

Global Historical Climatology Network daily (GHCND)

The Global Historical Climatology Network daily (GHCND) is an integrated database of daily climate summaries from land surface stations worldwide. It includes records from over 100,000 stations in 180 countries, providing data on variables such as temperature, precipitation, snowfall, and snow depth. This data is provided in its raw form by the National Oceanic and Atmospheric Administration here. The same data is available in different formats published as an AWS Open Data dataset. The latter dataset was selected because it provides data as comma-separated-values instead of fixed-width tables.

There are two raw ‘tables’ available from the dataset: stations and daily. stations is available as a single fixed-width text file, and provides a mapping of weather station IDs and geographic coordinates. daily is made up of csv records, where each record corresponds to a single observation made by a station. The data is partitioned by year, with a single compressed csv for each year. Source data exists for the year range of 1776 to 2025, however before the year 2000 there is a comparatively small amount of data. It was decided to focus on data from 2000 onwards.

In order to create cleaned tables, it was necessary first to mirror the raw data locally to avoid the need for repeated downloads of the same files. A python script was written which parses the s3 bucket manifest and downloads all daily files with the csv.gz suffix for years after \(1999\). The script also downloads a copy of the file stations.txt, the raw source for stations.

First 5 lines of daily file 2010.csv.gz
AE000041196,20100101,TMAX,259,,,S,
AE000041196,20100101,TMIN,120,,,S,
AE000041196,20100101,TAVG,181,H,,S,
AEM00041194,20100101,TMAX,250,,,S,
AEM00041194,20100101,TMIN,168,,,S,
First 5 lines of stations.txt
ACW00011604  17.1167  -61.7833   10.1    ST JOHNS COOLIDGE FLD                       
ACW00011647  17.1333  -61.7833   19.2    ST JOHNS                                    
AE000041196  25.3330   55.5170   34.0    SHARJAH INTER. AIRP            GSN     41196
AEM00041194  25.2550   55.3640   10.4    DUBAI INTL                             41194
AEM00041217  24.4330   54.6510   26.8    ABU DHABI INTL                         41217

National US Forest Service Fire Occurence Point

The FireOccurrence point layer represents ignition points, or points of origin, from which individual USFS wildland fires started. Data are maintained at the Forest/District level, or their equivalent, to track the occurrence and the origin of individual USFS wildland fires. It includes variables such as Discovery Date, Total Acres, Stat Cause, LatitudeDD83, LongitudeDD83 and much more. All of the information is publicly avaliable here. There are multiple formats that the data can be downloaded in, but our group used a csv file.

Transformations

GHCND

The transformations for the GHCN Daily data is a single script because the required transformations were changing almost every day, and it was often necessary to rebuild the dataset outright. This also allows for anyone to reproduce the dataset.

The raw daily data includes one column designating the type of observation, and one the actual value recorded. The desired end result was a single table, with a column for each desired observation type or element. Once joined with station locations, the data was indexed using the h3 library. This has many benefits which are described below

The basic steps are as follows: stations.txt needs to be converted from a fixed-width text file to a typed parquet file. Each daily file needs to be decompressed, typed, and reoriented into the correct format.

The transformation steps are shown in this diagram:

graph TD
    A[stations.txt] --> D
    D[read fwf] --> H

    C[raw daily] --> B[decompress]
    B --> E[process dates]
    E --> F[drop unwanted columns]
    F --> G[create single element tables]
    G --> I[join element tables]
    I --> J[join with stations]
    J --> K[generate point geometries]
    K --> L[index with h3]
    L --> H

    H[save as parquet]

    M[read stations.parquet] --> J

Once the data was in a viewable format, columns to drop were chosen based by generating percentage nulls. Each time it was decided to keep or drop a column, the column was added or removed from a keep_columns list, and the pipeline was re-run. See Table 1 for the resultant table.

Table 1: First 5 rows of 2010.parquet
station_id year month day prcp snow snwd tmax tmin awnd awdr evap latitude longitude elevation state geometry Hexagon_ID
CA001011500 2010 1 1 245 0 0 85 35 NA NA NA 48.9333 -123.75 75 BC 0x0101000000ACADD85F767748400000000000F05EC0 8328dcfffffffff
CA001011500 2010 1 2 2 0 0 100 55 NA NA NA 48.9333 -123.75 75 BC 0x0101000000ACADD85F767748400000000000F05EC0 8328dcfffffffff
CA001011500 2010 1 3 78 0 0 70 30 NA NA NA 48.9333 -123.75 75 BC 0x0101000000ACADD85F767748400000000000F05EC0 8328dcfffffffff
CA001011500 2010 1 4 205 0 0 65 50 NA NA NA 48.9333 -123.75 75 BC 0x0101000000ACADD85F767748400000000000F05EC0 8328dcfffffffff
CA001011500 2010 1 5 6 0 0 80 50 NA NA NA 48.9333 -123.75 75 BC 0x0101000000ACADD85F767748400000000000F05EC0 8328dcfffffffff

National US Forest Service Fire Occurence Point

The transformations to the fire occurrence dataset were done using R. The major issue with the dataset was the large number of null values; some of the 44 variables were unusable due to the amount of missing data. The variables included were UNIQFIREID, FIREYEAR, DISCOVERYDATETIME, FIREOUTDATETIME, SIZECLASS, TOTALACRES, STATCAUSE, LATDD83, and LONGDD83. Most of these variables only had NA/null values that needed to be removed. The hardest part was attempting to clean the Stat Cause column because of the sheer amount of mislabeled causes of fire. After dealing with these issues the dataset was reduced from 580,000 rows and 35 variables to 245,000 rows and 10 variables. Lastly, the finalized dataset was exported as a parquet file to preserve variable types and remove the need to repeatedly run the cleaning script.

The transformation steps are shown in this diagram:

  graph TD
    A[Read CSV] --> B[Select relevant columns]
    B --> C[Remove NA values]
    C --> D[Clean FIREYEAR ]
    D --> E[Verify SIZECLASS vs TOTALACRES]
    E --> F[Sample and inspect LAT/LONG]
    F --> G[Clean STATCAUSE]
    G --> H[Export as a parquet file]

First 5 lines of Fire Occurence
X Y OBJECTID GLOBALID FIREOCCURID REVDATE FIRENAME FIREYEAR UNIQFIREID DISCOVERYDATETIME SIZECLASS TOTALACRES STATCAUSE
-106.4278 39.84611 231055009 {540C6E70-51FD-4CCC-A1B3-8A6C914E34A3} 2B1B9A1B-7828-4688-A14C-4BD7CEE62226 2023/03/29 11:10:56+00 Elliott Ridge 2016 2016-COWRF-000679 2016/10/22 00:00:01+00 A 0.25 Camping
-107.9356 39.37278 231055010 {ACE77CF8-1281-4A81-BE3D-4D55669CB134} A2454D64-EBB7-4895-81BC-9782B3D1391E 2023/03/29 11:10:59+00 Battlement Mesa Reservoir 2016 2016-COWRF-000452 2016/08/07 00:00:01+00 A 0.10 Camping
-106.7233 39.11833 231055011 {D8EA7329-A0D0-4B64-A6F2-7A7FD86A3505} 2023/03/29 11:10:56+00 Difficult 1997 1997-COWRF-000 1997/07/26 00:00:01+00 A 0.10 Lightning
-106.5900 39.65667 231055012 {67A72390-0E2B-411E-9CDE-6657101645E4} 2023/03/29 11:10:56+00 1993 1993-COWRF-000 1993/08/23 00:00:01+00 A 0.10 Undetermined
-107.3173 40.01929 231055013 {DFFF8D93-DE89-451C-830C-1F65FD550B3D} 558D5D61-AB04-4AC3-8143-2E0EBE7CB506 2024/03/27 14:21:08+00 Paradise 2020 2020-COWRF-000504 2020/10/14 13:16:01+00 A 0.10 Camping
-106.5183 39.58833 231055014 {B7FC6338-897C-44AC-BD8B-7D1EB63EC7D8} 2023/03/29 11:10:56+00 B.C. 1994 1994-COWRF-000 1994/07/29 00:00:01+00 A 0.10 Lightning

Storage and Query Engine

To handle selections, projections, and group by aggregations on larger-than-memory datasets, PyArrow and Parquet were chosen for their support of lazy evaluation and predicate pushdown, which reduced query time by as much as 700%. These tools also support easy conversion to Pandas dataframes and there is also an Arrow for R package allowing some team members to use R where desired.

The largest operational issue was that the dataset needed to be identical across each developer’s machines. To avoid paying for cloud resources, the data was stored in a shared OneDrive folder and symlinked into the local repository by each user, ensuring uniform access paths while allowing individual developers to manage their data access. This setup allowed each user to download the dataset to their local system on demand, while keeping their copy up-to-date.

Initially, the assumption was that the entire dataset needed to be available to the dashboard, but later it was realized that only the 3-day averages produced as part of the model training were necessary. This allowed the dashboard’s live data to be loaded into memory on initialization. This means that the dashboard does not require high througput IO access.

Exploration and Analysis

Figure 2: An example of Uber’s H3 Cell Grid

After cleaning the data and performing exploratory data analysis, it became evident that a method was needed to handle data containing both geospatial and temporal dimensions. A significant challenge in the analysis was only having positive data. The dataset contained a list of places where fires occurred, but not where they did not occur. This complicates the development of binary classification models for predicting fire probability, as the training data includes only positive instances.

To address these issues, Uber’s H3 spatial indexing system was adopted, as illustrated in Figure 2. H3 divides the globe into hexagonal cells of varying resolutions. Although not as geographically precise as some administrative boundaries, H3 cells offer advantages in computational efficiency and consistent distances between neighboring cells. A grid of H3 cells was applied across the U.S., with each cell covering approximately 4,785 square miles, resulting in a total of 712 cells. For each day in the dataset, it was recorded whether or not a fire occurred within each cell. Weather station data within each hexagon was aggregated to calculate average meteorological values.

Using H3 cells also eliminated the need for repeated spatial joins between fire and weather datasets. Once the cell index was known, the relevant weather stations and fire incidents could be directly connected, significantly reducing computational overhead.

Another challenge involved missing data. If all weather stations within a cell provided null values for a given variable on a specific day, those values were imputed using the average from neighboring cells. Precipitation data was excluded from the analysis entirely, as over 50% of its values were missing.

Modeling

Data Preparation

With the addition of H3 cells and inclusion of negative negative data for cells without a fire, a predictive model was developed to estimate \(Y_ij\), the probability of a fire occurring in cell \(i\) on day \(j\), constrained by \([0,1]\). To construct the predictor variables, weather data was averaged over a three-day window: the two days preceding the fire and the day of the fire itself.
Prediction variables were as follows:

  • Hexagon location
  • Average Elevation
  • Temperature Maximum (3-day average)
  • Temperature Minimum (3-day average)
  • Daily Average Wind Speed (3-day average)
  • Precipitation (3-day average)
  • Snowfall (3-day average)
  • Month

Model Selection

The dataset was divided into training and testing sets, with the training period spanning from January 1, 2000, to December 31, 2019, and the testing period covering January 1, 2020, to March 23, 2025. Hexagonal cell locations were encoded as numeric variables to be used as features in the modeling process. Four different models were trained: logistic regression, random forest, XGBoost, and Naive Bayes. To evaluate model performance, prediction thresholds ranging from 0 to 1 were simulated in 0.01 increments. For each model, the threshold that maximized the F1 score was selected as optimal. The F1 score is defined as: \(\frac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}}\).

  • Precision measures how often a positive wildfire prediction is correctly positive. It can be defined as \(\frac{\text{TP}}{\text{TP} + \text{FP}}\) where TP represents true positives and FP represents false positives.
  • Recall measures the model’s ability to correctly identify actual wildfire events from all real cases. It is defined as: \(\frac{\text{TP}}{\text{TP} + \text{FN}}\) where FN represents false negatives.

In this context, the F1 score is favored over accuracy due to the imbalanced nature of the dataset, where the occurrence of wildfires is relatively rare. Relying on accuracy would bias the model toward predicting non-fire outcomes, potentially minimizing false positives at the risk of missing true fire events. Since false negatives (failing to predict an actual wildfire) are more consequential than false positives, optimizing for F1 score better balances precision and recall, making it a more appropriate metric for evaluating model performance in wildfire prediction.

Figure 3 shows the optimal thresholds curves, while Table Table 2 displays statistics for each model in the selection process.

Figure 3: Thresholds for the optimal F1 score using each model
Table 2: Statistics for each model fit on historical weather data
Model Optimal.Threshold Accuracy Precision Recall F1.Score
RandomForest 0.22 0.965690 0.141321 0.305115 0.193171
XGBoost 0.18 0.962545 0.134693 0.328602 0.191068
LogisticRegression 0.08 0.940636 0.063594 0.248456 0.101268
GaussianNB 0.65 0.934992 0.074689 0.336228 0.122227

Based on the evaluation results, the random forest model was selected for deployment. Although both XGBoost and random forest performed comparably, the random forest model achieved a higher F1 score and greater overall accuracy. Additionally, familiarity with the random forest algorithm supported its use in this project. However, it is worth noting that the XGBoost model demonstrated higher recall—a valuable trait in this context, given the high cost associated with false negatives. With additional time and resources, further exploration and tuning of the XGBoost model would be a logical next step to potentially enhance detection of true wildfire events.

Hyperparameter Selection

After selecting the random forest as the optimal model, hyperparameter tuning was conducted using a grid search with 3-fold cross-validation. Due to storage constraints, this tuning was performed on a randomly-selected subset of 10% of the data. The grid search explored the following parameter combinations:

  • Number of estimators: 50, 100, 200

  • Maximum tree depth: None, 10, 20, 30

  • Number of features considered at each split: 2, 3

This resulted in 24 total parameter combinations. The model selection was based on the F1 score, evaluated at a decision threshold of 0.22. The best-performing configuration included 200 estimators, a maximum depth of 30, and 3 features considered per split.

The resulting trained model is shown in the following code:

best_rf = RandomForestClassifier(
    random_state=42, 
    n_jobs=-1, 
    n_estimators=best_params['n_estimators'],
    max_depth=best_params['max_depth'],
    max_features=best_params['max_features']
)

best_rf.fit(X_train, y_train)

Results

After training the fine-tuned model on both the training and testing datasets, the final performance metrics are summarized in Table 3. Hyperparameter tuning led to a modest improvement in key metrics, with the F1 score increasing from 0.193 to 0.197 and recall improving from 0.305 to 0.330. While these values remain relatively low, such results are not unexpected given the limitations of the dataset and the inherent unpredictability of wildfire events.

Feature importance analysis, presented in Table 4, highlights that location, elevation, and temperature were among the most influential predictors, underscoring the relevance of geographic and meteorological context in modeling wildfire risk.

Table 3: Model metrics for random forest after hyperparameter tuning
Metric Value
Optimal Threshold 0.210000
Accuracy 0.957575
Precision 0.148936
Recall 0.333333
F1 Score 0.197802
Table 4: Feature importance for random forest model
Feature Importance.Score
Hexagon Location 0.2056
Average Elevation 0.1826
Temperature Maximum (3-Day Average) 0.1685
Temperature Minimum (3-Day Average) 0.1611
Daily Average Wind (3-Day Average) 0.1345
Precipitation (3-Day Average) 0.1003
Month 0.0414
Snowfall (3-Day Average) 0.0059

After running the model, the output was saved as a parquet file and loaded later into the dashboard.

Challenges

One of the primary challenges encountered during model development was the sheer volume of data involved. Since the dataset incorporated both temporal and spatial dimensions, it contained a row for every unique combination of dates from 2020 to 2025 and each of the 712 H3 hexagon locations. This resulted in a total of approximately 188,240,121 rows—far too large to load and process on local devices.

To address this, high-performance workstations at The Catalyst in Parks Library were utilized. These machines, available for reservation by students through the Iowa State Library, provided the enough computing resources to handle the data efficiently. Once the data was processed and the model trained, the outputs were saved and transferred using OneDrive for further use.

Dashboard

Initial Planning

Initially, it was assumed that using a no-code dashboard system would allow greater focus on content development. To this end, ArcGIS Online was evaluated. While it proved easy to use for ad hoc analysis, it struggled to efficiently render map layers due to the size of the dataset. Given these performance limitations, a code-based dashboarding framework was chosen instead. Dash, Streamlit, and Shiny were all evaluated as potential options.

UI/UX Design

The dashboard is organized into three main tabs, each offering distinct functionalities to enhance user experience:

  1. Map View: This tab enables users to select a specific date and visualize wildfire predictions across hexagonal regions on a map of the United States. The date can be adjusted using a slider at the bottom of the page. Additionally, users can choose different variables from the tab menu to display relevant weather data for that particular date, making it easy to analyze weather patterns and areas with higher fire probability. Users can click on any individual hexagon to view both the model output and the corresponding weather data for that specific region and date. The model output can be copied to the clipboard for further analysis and reporting.

  2. Time Series Plot: Created in R using Plotly and hosted as an HTML file, this plot displays two lines: the red line represents the number of fires occurring per week, while the green line indicates the average acres burned during that week. The plot includes buttons in the top-left corner, allowing users to toggle between 6-month and yearly view increments, providing insights into wildfire trends over different time periods.

  3. Dashboard Info: This tab offers details about the underlying model and guidance on how to use the dashboard effectively. Developed in R using Quarto and implemented as an HTML page, this section ensures users understand the dashboard’s features and the model’s workings.

Figure 4: An example wildfire prediction dashboard

Software Stack

When selecting the dashboard framework, the two major options were Python Dash, and R Shiny. Dash was selected because the majority of the ETL code was already written in python.

Python Libraries

  • dash: Framework for builidng data web apps
  • dash-leaflet: A collection of Dash components implementing LeafletJS for interactive mapping
  • shapely: Provides pythonic geospatial data types and operations
  • geopandas: An extension of pandas which supports shapely types and operations.
  • geoArrow: An extension of Arrow which allows for the storage of geospatial data types
  • PyArrow: Python bindings for Apache Arrow, allows efficient processing of larger-than-memory data

Deployment

The application is deployed using a docker container. The source repository includes both a Dockerfile and a docker-compose.yml. This allows for many deployment options. It can be deployed locally using docker compose, or in the cloud using Vercel, Azure Web Apps, AWS Amplify, etc.

Performance Considerations and Challenges

The application was initially designed to query data directly from disk each time the map inputs changed. This approach, resulted in noticeable delays in loading times: more than a 3 active clients would cause significant IO pressure. Because of this, it was decided to plot aggregated data instead of raw data, this reduced the data size to 1/100th of what it was By loading the necessary data into memory (approximately 1.2 GB) during the application’s initialization phase, the system was able to leverage in-memory operations for efficient data management. This had the added benefit on removing the need for making the full dataset available to the app. The necessary parquet files are small enough to be packaged directly into the docker image.

The only major issue was with dash-leaflet: the hexagon tile layer refused to update when the date input was changed, This was due to internal caching behavior which reduces load by only rerendering components if the component’s id changed. This was resolved by including the selected date in the polygon layer’s id, meaning that layers were unique across all dates. This allowed the map to update properly.

Availability/Source Code Location

All of the source code can be found at https://github.com/nathanrethwisch/FinalProjectDS4010A. The dashboard is currently hosted at https://wildfire.ddole.net/.

Conclusion

Summary

Using historical fire data from the United States Forest Service (USFS) and weather data from the National Oceanic and Atmospheric Administration (NOAA), a random forest model and an associated interactive dashboard were developed to predict the probability of wildfires occurring in a specific area on a given date. The dashboard enables users to explore various dates and weather trends that contribute to higher fire probabilities. This tool can be valuable for informing policy decisions and improving the understanding of the relationship between weather patterns and wildfire occurrences.

Further Discussion

Given more time, the analysis could be expanded by incorporating additional datasets to improve predictive accuracy. Key factors such as soil conditions and wind direction in specific areas would be integrated as important predictors. Additionally, exploring the likelihood of a wildfire occurring again in regions that have recently experienced a fire would be a valuable area of focus. With access to greater computational resources, predictions could be further refined by using smaller hexagonal grids, offering more precise insights into distinct geographical climates. Furthermore, the dashboard could be enhanced to support future fire prediction efforts by incorporating upcoming weather forecasts, enabling the presentation of forecasted probabilities of fire occurrence in specific areas. To ensure scalability and improved data access, the datasets should also be migrated to cloud object storage. These ideas serve as a foundation for ongoing research and continued development of the dashboard.

Sources

Kondylatos, Spyros, Ioannis Prapas, Michele Ronco, Ioannis Papoutsis, Gustau Camps-Valls, María Piles, Miguel-Ángel Fernández-Torres, and Nuno Carvalhais. 2022. “Wildfire Danger Prediction and Understanding with Deep Learning.” Geophysical Research Letters 49 (17): e2022GL099368. https://doi.org/https://doi.org/10.1029/2022GL099368.
U.S. Congress Joint Economic Committee. 2023. “Climate-Exacerbated Wildfires Cost the u.s. Between $394 to $893 Billion Each Year in Economic Costs and Damages.” https://www.jec.senate.gov/public/_cache/files/9220abde-7b60-4d05-ba0a-8cc20df44c7d/jec-report-on-total-costs-of-wildfires.pdf.